AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

Update: 2023-05-30

Description

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PaLM-2 & GPT-4 in "Extrapolating GPT-N performance", published by Lukas Finnveden on May 30, 2023 on The AI Alignment Forum.
Two and a half years ago, I wrote Extrapolating GPT-N performance, trying to predict how fast scaled-up models would improve on a few benchmarks. One year ago, I added PaLM to the graphs. Another spring has come and gone, and there are new models to add to the graphs: PaLM-2 and GPT-4. (Though I only know GPT-4's performance on a small handful of benchmarks.)
Converting to Chinchilla scaling laws
In previous iterations of the graph, the x-position represented the loss on GPT-3's validation set, and the x-axis was annotated with estimates of size+data that you'd need to achieve that loss according to the Kaplan scaling laws. (When adding PaLM to the graph, I estimated its loss using those same Kaplan scaling laws.)
In these new iterations, the x-position instead represents an estimate of (reducible) loss according to the Chinchilla scaling laws. Even without adding any new data-points, this predicts faster progress, since the Chinchilla scaling laws describes how to get better performance for less compute.
The appendix describes how I estimate Chinchilla reducible loss for GPT-3 and PaLM-1. Briefly: For the GPT-3 data points, I convert from loss reported in the GPT-3 paper, to the minimum of parameters and tokens you'd need to achieve that loss according to Kaplan scaling laws, and then plug those numbers of parameters and tokens into the Chinchilla loss function. For PaLM-1, I straightforwardly put its parameter- and token-count into the Chinchilla loss function.
To start off, let's look at a graph with only GPT-3 and PaLM-1, with a Chinchilla x-axis.
Here's a quick explainer of how to read the graphs (the original post contains more details). Each dot represents a particular model’s performance on a particular category of benchmarks (taken from papers about GPT-3 and PaLM). Color represents benchmark; y-position represents benchmark performance (normalized between random and my guess of maximum possible performance).
The x-axis labels are all using the Chinchilla scaling laws to predict reducible loss-per-token, number of parameters, number of tokens, and total FLOP (if language models at that loss were trained Chinchilla-optimally).
Compare to the last graph in this comment, which is the same with a Kaplan x-axis. Some things worth noting:
PaLM is now ~0.5 OOM of compute less far along the x-axis. This corresponds to the fact that you could get PaLM for cheaper if you used optimal parameter- and data-scaling.
The smaller GPT-3 models are farther to the right on the x-axis. I think this is mainly because the x-axis in my previous post had a different interpretation.
The overall effect is that the data points get compressed together, and the slope becomes steeper. Previously, the black "Average" sigmoid reached 90% at ~1e28 FLOP. Now it looks like it reaches 90% at ~5e26 FLOP.
Let's move on to PaLM-2. If you want to guess whether PaLM-2 and GPT-4 will underperform or outperform extrapolations, now might be a good time to think about that.
PaLM-2
If this CNBC leak is to be trusted, PaLM-2 uses 340B parameters and is trained on 3.6T tokens. That's more parameters and less tokens than is recommended by the Chinchilla training laws. Possible explanations include:
The model isn't dense. Perhaps it implements some type of mixture-of-experts situation that means that its effective parameter-count is smaller.
It's trained Chinchilla-optimally for multiple epochs on a 3.6T token dataset.
The leak is wrong.
If we assume that the leak isn't too wrong, I think that fairly safe bounds for PaLM-2's Chinchilla-equivalent compute is:
It's as good as a dense Chinchilla-optimal model trained on just 3.6T tokens, i.e. one with 3.6T/20=180B parameters. This would ...

Comments

In Channel

AF - Meta Questions about Metaphilosophy by Wei Dai

2023-09-0104:42

AF - Red-teaming language models via activation engineering by Nina Rimsky

2023-08-2612:38

AF - Causality and a Cost Semantics for Neural Networks by scottviteri

2023-08-2116:47

AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann

2023-08-2005:36

AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

2023-08-1605:28

AF - Reducing sycophancy and improving honesty via activation steering by NinaR

2023-07-2814:26

AF - How LLMs are and are not myopic by janus

2023-07-2513:24

AF - Open problems in activation engineering by Alex Turner

2023-07-2401:37

AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

2023-07-2316:29

AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti

2023-07-2109:51

AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

2023-07-1902:26

AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan

2023-07-1810:16

AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse

2023-07-0205:04

AF - Agency from a causal perspective by Tom Everitt

2023-06-3011:40

AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H

2023-06-2639:24

AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger

2023-06-1612:34

AF - Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS) by Scott Emmons

2023-05-3112:27

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

2023-05-3011:50

AF - Wikipedia as an introduction to the alignment problem by SoerenMind

2023-05-2901:37

AF - [Linkpost] Interpretability Dreams by DanielFilan

2023-05-2403:26

00:00

1.0x

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

#box-pro-ellipsis-176252281269017{-webkit-line-clamp:2;}AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden

Lukas Finnveden

AF - PaLM-2 and GPT-4 in "Extrapolating GPT-N performance" by Lukas Finnveden